User Guide

OCR Xpress™ for Java is a powerful full-page Optical Character Recognition (OCR) product. The OCR engine is based on a C API. The OCR Xpress for Java SDK can be used as a stand-alone OCR engine or in conjunction with other Accusoft products like ImageGear for C/C++.

The OCR Xpress for Java toolkit provides support in the following areas:

Full page stand-alone OCR (no pre-filtering or pre-processing of the image is required) of non-compressed BMP files.
Export to TXT file.
Export to searchable PDF file (single-page or multi-page).
Access to hierarchically structured OCR results for advanced post-processing.

Full Page Stand-Alone OCR of Non-Compressed BMP Files

It supports creation of searchable documents from uncompressed BMP files for distribution to end users. Any Java BufferedImage in an uncompressed 1-BPP, 8-BPP, and 24-BPP format can be loaded and processed without any image pre-filtering or pre-processing. Searchable documents in a variety of text or text plus image formats are supported.

The OCR Xpress for Java SDK provides access to document recognition technology for images to enable the extraction of text from the document. It is an omni-font text recognition component that supports multiple output file formats, including text and PDF. It also supports the output of structured results for detailed examination of the recognition output results, including the area and confidence of the character text values.

Export to Searchable PDF File

OCR Xpress for Java provides the conversion of BMP formatted images into a searchable PDF documents. One or more BMP images can be built into a single PDF document.

In addition, OCR Xpress for Java provides a rich API that allows the customer to access the same internal OCR results used to generate the PDF documents.

Export to TXT File

OCR Xpress for Java can also convert an image to a TXT file for archiving searchable text. By archiving the original image with the searchable text file in a database, it can later be retrieved according to the results of searches for key words or phrases in the text file.

Access to Hierarchically Structured OCR Results for Advanced Post-Processing

For applications that need to access post-OCR data for processing purposes, OCR Xpress for Java generates and maintains an internal hierarchical model of the text it finds in an image. Every character is hierarchically tied to the word, text line, text block, region, and page with which it is associated. The same is true of every word, text line, text block, region, and page of the generated document. The rich API allows the application to access this internal hierarchical model. With OCR Xpress for Java, a form reader application can extract data from the form based on its location. The API also provides confidence levels of the text in question so that the application can make content usage decisions based on the confidence that the recognized text is correct. The complete internal hieratical model can even be copied into an application's local work space to allow the application to perform higher levels of segmentation, content association, and analysis of the text data.

Limitation of OCR Xpress

Even though OCR Xpress for Java is designed to be a stand-alone OCR engine, there are some pre-OCR image processing operations that may need to be performed on the input image to achieve optimal results.

OCR Xpress for Java:

does not Deskew the input image.
accepts only uncompressed BMP files.

Accusoft offers other products like ImageGear Professional to be companion components in a document recognition application. If your applications requires these pre-OCR image processing operations, Accusoft recommends using the Accusoft components that provide these functionalities.

For information on how to register and license all your Accusoft components, see Licensing.